Initially filtering on 3 things:

vcftools \
--gzvcf PHHA.vcf.gz \
--remove-indels \
--min-alleles 2 \
--max-alleles 2 \
--thin 100 \
--minQ 30 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--out PHHA.bithin.q30

Produces file PHHA.bithin.q30.recode.vcf

Making a summary file to look at individual missingness

vcftools \
--vcf PHHA.bithin.q30.recode.vcf \
--missing-indv \
--out PHHA.bithin.q30

Produces file called PHHA.bithin.q30.imiss

256 of 256 individuals

669,819 SNPs

Just looking for some natural breaks here. Wanting to remove individuals that look like obvious outliers without drastically cutting certain populations or lowering within-population sampling. 0.88 looks like a decent threshold. If we wanted to be more aggressive, another option would be 0.85 (would remove another 11 individuals). Probably can’t go much lower without losing DA and CA populations entirely.

Below showing individual totals for each population based on 0.88 threshold.

Individual depth (averaged across all loci). Filtered out individuals >88% missigness as identified above.

245 of 256 individuals

669,819 SNPs

snp_set <- filter(full_depth, depth_adj < 12) %>%
  select(chr, pos)

write_delim(snp_set, '../filtering/sites_maxdp12.tsv', delim = '\t', col_names = F)

Allelic missingness <30% (allele in minimum 172 of 245)


MAF = 0.01 (minor allele in >= 3 individuals)

15,009 SNPs


MAF = 0.02 (minor allele in >= 5 individuals)

8,993 SNPs


MAF = 0.03 (minor allele in >= 8 individuals)

6,811 SNPs


MAF = 0.04 (minor allele in >= 8 individuals)

5,617 SNPs


MAF = 0.05 (minor allele in >= 13 individuals)

4,843 SNPs


MAF = 0.06 (minor allele in >= 13 individuals)

4,276 SNPs

Allelic missingness <20% (allele in minimum 196 of 245)


MAF = 0.01 (minor allele in >= 3 individuals)

8,488 SNPs


MAF = 0.03 (minor allele in >= 8 individuals)

4,030 SNPs


MAF = 0.05 (minor allele in >= 13 individuals)

2,900 SNPs